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Abstract 


This paper addresses the question of how 
language use affects community reaction 
to comments in online discussion forums, 
and the relative importance of the mes¬ 
sage vs. the messenger. A new comment 
ranking task is proposed based on com¬ 
munity annotated karma in Reddit discus¬ 
sions, which controls for topic and tim¬ 
ing of comments. Experimental work 
with discussion threads from six subred- 
dits shows that the importance of different 
types of language features varies with the 
community of interest. 


1 Introduction 


Online discussion forums are a popular platform 
for people to share their views about current events 
and learn about issues of concern to them. Discus¬ 
sion forums tend to specialize on different topics, 
and people participating in them form communi¬ 
ties of interest. The reaction of people within a 
community to comments posted provides an indi¬ 
cation of community endorsement of opinions and 
value of information. In most discussions, the vast 
majority of comments spawn little reaction. In this 
paper, we look at whether (and how) language use 
affects the reaction, compared to the relative im¬ 
portance of the author and timing of the post. 

Early work on factors that appear to influence 
crowd-based judgments of comments in the Slash- 
dot forum ( jLampe and Resnick, 2004) ) indicate 
that timing, starting score, length of the comment, 
and poster anonymity/reputation appear to play 
a role (where anonymity has a negative effect). 
Judging by differences in popularity of various 
discussion forums, topic is clearly important. Ev¬ 
idence that language use also matters is provided 


by recent work (Danescu-Niculescu-Mizil et ah. 


2012] Lakkaraju et al., 2013 Althoff et al., 2014 


Tan et ak, 2014| ). Teasing these different factors 
apart, however, is a challenge. The work presented 
in this paper provides additional insight into this 
question by controlling for these factors in a dif¬ 
ferent way than previous work and by examining 
multiple communities of interest. Specifically, us¬ 
ing data from Reddit discussion forums, we look at 
the role of author reputation as measured in terms 
of a karma k-index, and control for topic and tim¬ 
ing by ranking comments in a constrained window 
within a discussion. 

The primary contributions of this work include 
findings about the role of author reputation and 
variation across communities in terms of aspects 
of language use that matter, as well as the problem 
formulation, associated data collection, and devel¬ 
opment of a variety of features for characterizing 
informativeness, community response, relevance 
and mood. 


2 Data 


Rcddij}] is the largest public online discussion fo¬ 
rum with a wide variety of subreddits, which 
makes it a good data source for studying how tex¬ 
tual content in a discussion impacts the response 
of the crowd. On Reddit, people initiate a dis¬ 
cussion thread with a post (a question, a link to 
a news item, etc.), and others respond with com¬ 
ments. Registered users vote on which posts and 
comments are important. The total amount of up 
votes minus the down votes (roughly) is called 
karma ; it provides an indication of community en¬ 
dorsement and popularity of a comment, as used 


in (Lakkaraju et ak, 20131. Karma is valued as it 
impacts the order in which the posts or comments 
are displayed, with the high karma content rising 
to the top. Karma points are also accumulated by 
members of the discussion forum as a function of 
the karma associated with their comments. 


1 http://www.reddit.com 















subreddit 

# Posts 

# Comments/Post 

FITNESS 

3K 

16.3 

ASKSCIENCE 

4K 

8.8 

POLITICS 

7K 

23.7 

ASKWOMEN 

4K 

50.5 

ASKMEN 

4K 

58.3 

WORLDNEWS 

12K 

26.1 



Topi 

Top3 

ASKSCIENCE 

9.3 

25.9 

FITNESS 

1.4 

12.3 

POLITICS 

0.3 

7.4 

ASKWOMEN 

2.2 

13.5 

ASKMEN 

3.9 

11.9 

WORLDNEWS 

3.1 

6.4 


Table 1: Data collection statistics. 


The Reddit data is highly skewed. Although 
there are thousands of active communities, only 
a handful of them are large. Similarly, out of 
the more than a million comments made per dayj^J 
most of them receive little to no attention; the dis¬ 
tributions of positive comment karma and author 
karma are Zipfian. Slightly more than half of all 
comments have exactly one karma point (no votes 
beyond the author), and only 5% of comments 
have less than one karma point. 

For this study, we downloaded all the posts and 
associated comments made to six subreddits over 
a few weeks, as summarized in Table [I] as well as 
karma of participants in the discussiorjj All avail¬ 
able comments on each post were downloaded at 
least 48 hours after the post was madcj^] 


3 Uptake Factors 


Factors other than the language use that influence 
whether a comment will have uptake from the 
community include the topic, the timing of the 
message, and the messenger. These factors are all 
evident in the Reddit discussions. Some subred¬ 
dits are more popular and thus have higher karma 
comments than others, reflecting the influence of 
topic. Comments that are posted early in the dis¬ 
cussion are more likely to have high karma, since 
they have more potential responses. 

Previous studies on Twitter show that the rep¬ 
utation of the author substantially increases the 
chances of the retweet ( |Suh et al., 2010 (Cha et 


al., 20101, and reputation is also raised as a fac¬ 


tor in Slashdot (Lampe and Re snick, 2004). On 
Reddit most users are anonymous, but it is possi¬ 
ble that members of a forum become familiar with 
particular usernames associated with high karma 
comments. In order to see how important per- 


2 http://www.redditblog.com/2014/12/reddit-in-2014.html 
3 Our data collection is available online at 
https://ssli.ee.washington.edu/tial/data/reddit 

4 Based on our initial look at the data, we noticed that most 
posts receive all of their comments within 48 hours. Some 
comments are deleted before we are able to download them. 


Table 2: Percentage of discussions where the top 
comment is made by the top k-index person (or top 
3 people) in the discussion. 


sonal reputation is, we looked at how often the 
top karma comments are associated with the top 
karma participants in the discussion. Since an in¬ 
dividual’s karma can be skewed by a few very pop¬ 
ular posts, we measure reputation instead using a 
measure we call the k-index, defined to be equal 
to the number of comments in each user’s history 
that have karma > The k-index is analgous to 
the h-index ( Hirsch, 2005} ) and arguably a better 
indicator of extended impact than total karma. 

The results in Table [2] address the question of 
whether the top karma comments always come 
from the top karma person. The Topi column 
shows the percentage of threads where the top 
karma comment in a discussion happens to be 
made by the highest k-index person participating 
in the discussion; the next column shows the per¬ 
centage of threads where the comment comes from 
any one of the top 3 k-index people. We find 
that, in fact, the highest karma comment in a dis¬ 
cussion is rarely from the highest k-index people. 
The highest percentage is in ASKSCIENCE, where 
expertise is more highly valued. If we consider 
whether any one of the multiple comments that the 
top k-index person made is the top karma com¬ 
ment in the discussion, then the frequency is even 
lower. 


4 Methods 
4.1 Tasks 

Flaving shown that the reputation of the author 
of a post is not a dominating factor in predicting 
high karma comments, we propose to control for 
topic and timing by ranking a set of 10 comments 
that were made consecutively in a short window 
of time within one discussion thread according to 
the karma they finally received. The ranking has 
access to the comment history about these posts. 
This simulates the view of an early reader of these 
posts, i.e., without influence of the ratings of oth- 

















ers, so that the language content of the post is more 
likely to have an impact. Very long threads are 
sampled, so that these do not dominate the set of 
lists. Approximately 75% of the comment lists are 
designated for training and the rest is for testing, 
with splits at the discussion thread level. Here, 
feature selection is based on mean precision of 
the top-ranked comment (P@l), so as to empha¬ 
size learning the rare high karma events. (Note 
that P@1 is equivalent to accuracy but allows for 
any top-ranking comment to count as correct in the 
case of ties.) The system performance is evaluated 
using both P@ 1 and normalized discounted cumu¬ 
lative gain (NDCG) ( {Burges et al., 2005 1, which is 
a standard criterion for ranking evaluation when 
the samples to be ranked have meaningful differ¬ 
ences in scores, as is the case for karma of the 
comments. 


In addition, for analysis purposes, we report re¬ 
sults for three surrogate tasks that can be used in 
the ranking problem: i) the binary ranker trained 
on all comment pairs within each list, in which low 
karma comments dominate, ii) a positive vs. neg¬ 
ative karma classifier, and iii) a high vs. medium 
karma classifier. All use class-balanced data; the 
second two are trained and tested on a biased sam¬ 
pling of the data, where the pairs need not be from 
the same discussion thread. 


4.2 Classifier 


We use the support vector machine (SVM) rank 
algorithm (Joachims, 2002 1 to predict a rank or¬ 
der for each list of comments. The SVM is trained 
to predict which of a pair of comments has higher 
karma. The error term penalty parameter is tuned 
to maximize P@1 on a held-out validation set 
(20% of the training samples). 


Since much of the data includes low-karma 
comments, there will be a tendancy for the learn¬ 
ing to emphasize features that discriminate com¬ 
ments at the lower end of the scale. In order to 
learn features that improve P@l, and to under¬ 
stand the relative importance of different features, 
we use a greedy automatic feature selection pro¬ 
cess that incrementally adds one feature whose re¬ 
sulting feature set achives the highest P@ 1 on the 
validation set. Once all features have been used, 
we select the model with the subset of features that 
obtains the best P@ 1 on the validation set. 


4.3 Features 


The features are designed to capture several key at¬ 
tributes that we hypothesize are predictive of com¬ 
ment karma motivated by related work. The fea¬ 
tures are categorized in groups as summarized be¬ 
low, with details in supplementary material. 

• Graph and Timing (G&T): A baseline that 
captures discourse history (response structure) 
and comment timing, but no text content. 

• Authority and Reputation (A&R): K-index, 
whether the commenter was the original poster, 
and in some subreddits “flair” (display next to a 
comment author’s username that is subject to a 
cursory verification by moderators). 

• Informativeness (Info.): Different indicators 
suggestive of informative content and novelty, 
including various word counts, named entity 
counts, urls, and unseen n-grams. 

• Lexical Unigrams (Lex.): Miscellaneous word 
class indicators, puncutation, and part-of- 
speech counts 

• Predicted Community Response (Resp.): 

Probability scores from surrogate classification 
tasks (reply vs. no reply, positive vs. negative 
sentiment) to measure the community response 
of a comment using bag-of-words predictors. 

• Relevance (Rel.): Comment similarity to the 
parent, post and title in terms of topic, computed 
with three methods: i) a distributed vector rep¬ 
resentation of topic using a non-negative matrix 
factorization (NMF) model (Xu et al., 2003| ), 


ii) the average of skip-gram word embeddings 
( jMikolov et al., 20T3] >, and iii) word set Jaccard 
similarity (Strehl et al., 2000 1 . 

• Mood: Mean and std. deviation of sentence sen¬ 
timent in the comment; word list indicators for 
politeness, argumentativeness and profanity. 

• Community Style (Comm.): Posterior proba¬ 
bility of each subreddit given the comment us¬ 
ing a bag-of-words model. 

The various word lists are motivated by fea¬ 
ture exploration studies in surrogate tasks. For 
example, projecting words to a two dimensional 
space of positive vs. negative and likelihood of 
reply showed that self-oriented pronouns were 
more likely to have no response and second- 
person pronouns were more likely to have a neg¬ 
ative response. The politeness and argumentative¬ 
ness/profanity lists are generated by starting with 
hand-specified seed lists used to train an SVM to 


classify word embeddings (Mikolov et al., 2013| 
















Figure 1: Relative improvement in P@1 over 
G&T for individual feature groups. 

into these categories, and expanding the lists with 
500 words farthest from the decision boundary. 

Both the NMF and the skip-gram topic models 
use a cosine distance to determine topic similarity, 
with 300 as the word embedding dimension. Both 
are trained on approximately 2 million comments 
in high karma posts taken across a wide variety of 
subreddits. We use topic models in various mea¬ 
sures of comment relevance to the discussion, but 
we do not use topic of the comment on its own 
since topic is controlled for by ranking within a 
thread. 

5 Ranking Experiments 

We present three sets of experiments on comment 
karma ranking, all of which show very differ¬ 
ent behavior for the different subreddits. Fig.[l] 
shows the relative gain in P@1 over the G&T 
baseline associated with using different feature 
groups. The importance of the different features 
reflect the nature of the different communities. 
The authority/reputation features help most for 
ASKSCIENCE, consistent with our k-index study. 
Informativeness and relevance help all subred¬ 
dits except ASKMEN and WORLDNEWS. Lexical, 
mood and community style features are useful in 
some cases, but hurt others. The predicted proba¬ 
bility of a reply was least useful, possibly because 
of the low-karma training bias. 

Tables[3] and 0] summarize the results for the 
P@1 and NDCG criteria using the greedy selec¬ 
tion procedure (which optimizes P@l) compared 
to a random baseline and the G&T baseline. The 
random baseline for P@1 is greater than 10% be¬ 
cause of ties. The G&T baseline results show that 
the graph and timing features alone obtain 21-32% 


subreddit 

Random 

G&T 

All 

ASKSCIENCE 

15.9 

21.8 

25.3 

FITNESS 

19.4 

22.1 

27.3 

POLITICS 

18.5 

24.7 

26.4 

ASKWOMEN 

17.6 

24.9 

28.0 

ASKMEN 

18.2 

31.4 

29.1 

WORLDNEWS 

15.4 

24.5 

23.3 

Improvement 

- 

42.9% 

52.1% 


Table 3: Test set precision of top one prediction 
(P@l) performance for specific subreddits. 


subreddit 

Random 

G&T 

All 

ASKSCIENCE 

0.53 

0.60 

0.60 

FITNESS 

0.57 

0.61 

0.62 

POLITICS 

0.55 

0.61 

0.62 

ASKWOMEN 

0.56 

0.62 

0.65 

ASKMEN 

0.56 

0.66 

0.66 

WORLDNEWS 

0.54 

0.61 

0.60 

Improvement 

- 

12.5% 

13.2% 


Table 4: Test set ranking NDCG performance for 
specific subreddits. 

of top karma comments depending on subreddits. 
Adding the textual features gives an improvement 
in P@1 performance over the G&T baseline for 
all subreddits except ASKMEN and WORLDNEWS. 
The trends for performance measured with NDCG 
are similar, but the benefit from textual features is 
smaller. The results in both tables show different 
ways of reporting performance of the same sys¬ 
tem, but the system has been optimized for P@1 
in terms of feature selection. In initial exploratory 
experiments, this seems to have a small impact: 
when optimizing for NDCG in feature selection 
we obtain 0.61 vs. 0.60 with the P@l-optimized 
features. 

A major challenge with identifying high karma 
comments (and negative karma comments) is that 


subreddit 

Pos/Neg 

High/Mid 

Ranking 

ASKSCIENCE 

44.5 

63.7 

61.3 

FITNESS 

74.7 

43.9 

57.5 

POLITICS 

95.5 

59.1 

58.0 

ASKWOMEN 

82.5 

67.6 

59.7 

ASKMEN 

87.0 

66.2 

60.6 

WORLDNEWS 

93.3 

69.9 

57.3 

Average 

79.6 

61.7 

59.1 


Table 5: Accuracy of binary classifiers trained on 
balanced data to distinguish: positive vs. nega¬ 
tive karma (Pos/Neg), high vs. mid-level karma 
(High/Mid), and ranking between any pair (Rank¬ 
ing). 
















they are so rare. Although our feature selection 
tunes for high rank precision, it is possible that 
the low-karma data dominate the learning. Alter¬ 
natively, it may be that language cues are mainly 
useful for identifying distinguishing the negative 
or mid-level karma comments, and that the very 
high karma comments are a matter of timing. To 
better understand the role of language for these 
different types, we trained classifiers on balanced 
data for positive vs. negative karma and high vs. 
mid levels of karma. For these models, the training 
pairs could come from different threads, but topic 
is controlled for in that all topic features are rela¬ 
tive (similarity to original post, parent, etc.). We 
compared the results to the binary classifier used 
in ranking, where all pairs are considered. In all 
three cases, random chance accuracy is 50%. 

Table[5] shows the pairwise accuracy of these 
classifiers. We find that distinguishing positive 
from negative classes is fairly easy, with the no¬ 
table exception of the more information-oriented 
subreddit ASKSCIENCE. Averaging across the dif¬ 
ferent subreddits, the high vs. mid task is slightly 
easier than the general ranking task, but the vari¬ 
ation across subreddits is substantial. The high 
vs. mid distinction for FITNESS falls below chance 
(likely overtraining), whereas it seems to be an 
easier task for the ASKWOMEN, ASKMEN, and 
WORLDNEWS. 


6 Related Work 


Interest in social media is rapidly growing in re¬ 
cent years, which includes work on predicting 
the popularity of posts, comments and tweets. 


Danescu-Niculescu-Mizil et al. (2012) investigate 


phrase memorability in the movie quotes. Cheng 
et al. (2014) explore prediction of information 
cascades on Facebook. Weninger et al. (2013) 
analyze the hierarchy of the Reddit discussions, 
topic shifts, and popularity of the comment, us¬ 
ing among the others very simple language anal¬ 
ysis. jLampos et al. (20141) study the problem of 


predicting a Twitter user impact score (determined 
by combining the numbers of user’s followers, fol- 
lowees, and listings) using text-based and non¬ 
textual features, showing that performance im¬ 
proves when user participation in particular topics 
is included. 

Most relevant to this paper are studies of the ef¬ 


fect of language in popularity predictions. Tan et 


al. (2014) study how word choice affects the pop¬ 


ularity of Twitter messages. As in our work, they 
control for topic, but they also control for the pop¬ 
ularity of the message authors. On Reddit, we find 
that celebrity status is less important than it is on 
Twitter since on Reddit almost everyone is anony¬ 
mous. Lakkaraju et al. (2013]) study how timing 


and language affect the popularity of posting im¬ 
ages on Reddit. They control for content by only 
making comparisons between reposts of the same 
image. Our focus is on studying comments within 
a discussion instead of standalone posts, and we 


analyze a vast majority of language features. Al- 


thoff et al. (2014 1 use deeper language analysis on 


Reddit to predict the success of receiving a pizza 
in the Random Acts of Pizza subreddit. To our 
knowledge, this is the first work on ranking com¬ 
ments in terms of community endorsement. 


7 Conclusion 

This paper addresses the problem of how language 
affects the reaction of community in Reddit com¬ 
ments. We collect a new dataset of six subredit dis¬ 
cussion forums. We introduce a new task of rank¬ 
ing comments based on karma in Reddit discus¬ 
sions, which controls for topic and timing of com¬ 
ments. Our results show that using language fea¬ 
tures improve the comment ranking task in most 
of the subreddits. Informativeness and relevance 
are the most broadly useful feature categories; rep¬ 
utation matters for ASKSCIENCE, and other cate¬ 
gories could either help or hurt depending on the 
community. Future work involves improving the 
classification algorithm by using new approaches 
to learning about rare events. 
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